A multivariate visualisation of nutritional content of Starbucks Drinks
For this take-home exercise, I am required to apply appropriate data visualisation techniques to accomplish the following task:
Heatmap is chosen as the data set contains multiple nutrition indicators (multivariate), and it is possible that these indicators could be correlated to each other e.g. high sugar content usually leads to high calorie. A sketch of the proposed visualisation is shown below:
For this visualization, the following packages are used:
packages = c('tidyverse', 'dendextend', 'heatmaply', 'pheatmap', 'rmarkdown')
for (p in packages){
if(!require(p, character.only =T)){
install.packages(p)
}
}
Since the source file is in csv format, read_csv() of readr package is used to import the data.
starbucks <- read_csv("data/starbucks_drink.csv")
As we are only interested in the category “kids-drinks-and-others”, the data of this category is extracted using filter() of dyplyr package.
#extract data of kids-drinks-and-others
kids <- starbucks %>%
filter (`Category`== "kids-drinks-and-other")
Also, noticed that there are several variables that are not in their correct data type i.e. ‘Caffeine’ should be a numeric and not character; ‘Size’, ‘Milk’ and ’ Whipped Cream’ should be converted from character type to a factor so that the data will be properly ordered subsequently. The functions as.numeric() and as.factor() are used to convert the variables.
kids$`Caffeine(mg)`<-as.numeric(kids$`Caffeine(mg)`)
kids$`Size`<-as.factor(kids$`Size`)
kids$`Milk`<-as.factor(kids$`Milk`)
kids$`Whipped Cream`<-as.factor(kids$`Whipped Cream`)
glimpse(kids)
Rows: 262
Columns: 18
$ Category <chr> "kids-drinks-and-other", "kids-drink~
$ Name <chr> "Cinnamon Dolce Crème", "Cinnamon Do~
$ `Portion(fl oz)` <dbl> 12, 12, 12, 12, 12, 12, 12, 12, 12, ~
$ Calories <dbl> 140, 210, 170, 230, 170, 230, 240, 3~
$ `Calories from fat` <dbl> 45, 100, 50, 110, 0, 60, 80, 140, 50~
$ `Total Fat(g)` <dbl> 5.0, 11.0, 6.0, 12.0, 0.0, 6.0, 9.0,~
$ `Saturated fat(g)` <dbl> 0.0, 4.0, 5.0, 9.0, 0.0, 4.0, 5.0, 9~
$ `Trans fat(g)` <dbl> 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0~
$ `Cholesterol(mg)` <dbl> 0, 20, 0, 20, 5, 25, 30, 45, 25, 40,~
$ `Sodium(mg)` <dbl> 120, 125, 130, 135, 120, 125, 120, 1~
$ `Total Carbohydrate(g)` <dbl> 25, 27, 28, 30, 32, 34, 32, 34, 32, ~
$ `Dietary Fiber(g)` <dbl> 1, 1, 0, 1, 0, 0, 0, 0, 0, 0, 1, 1, ~
$ `Sugars(g)` <dbl> 22, 25, 26, 28, 31, 33, 31, 33, 31, ~
$ `Protein(g)` <dbl> 2, 2, 1, 1, 10, 10, 9, 9, 9, 10, 8, ~
$ `Caffeine(mg)` <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ~
$ Size <fct> Tall, Tall, Tall, Tall, Tall, Tall, ~
$ Milk <fct> Almond, Almond, Coconut, Coconut, No~
$ `Whipped Cream` <fct> No Whipped Cream, Whipped Cream, No ~
There is a variable, “Portion(fl oz)” which represents the volume of drink, and is similar to another variable “Size”. To avoid having too many categories which will affect the visualisation and for a fairer comparison, data transformation is required to convert the values of the various nutrition indicators to be based on per fl oz (fluid ounce) instead e.g. total fat per fl oz. The group_by() and summarise() functions of dplyr package is used to perform the data transformation.
kids_grp <- kids %>%
dplyr::group_by(`Name`, `Whipped Cream`, `Milk`) %>%
summarise(`Calories`= sum(`Calories`)/sum(`Portion(fl oz)`),
`Calories from fat` = sum(`Calories from fat`)/sum(`Portion(fl oz)`),
`Total Fat(g)` = sum(`Total Fat(g)`)/sum(`Portion(fl oz)`),
`Saturated fat(g)` = sum(`Saturated fat(g)`)/sum(`Portion(fl oz)`),
`Trans fat(g)` = sum(`Trans fat(g)`)/sum(`Portion(fl oz)`),
`Cholesterol(mg)` = sum(`Cholesterol(mg)`)/sum(`Portion(fl oz)`),
`Sodium(mg)` = sum(`Sodium(mg)`)/sum(`Portion(fl oz)`),
`Total Carbohydrate(g)` = sum(`Total Carbohydrate(g)`)/sum(`Portion(fl oz)`),
`Dietary Fiber(g)` = sum(`Dietary Fiber(g)`)/sum(`Portion(fl oz)`),
`Sugars(g)` = sum(`Sugars(g)`)/sum(`Portion(fl oz)`),
`Protein(g)` = sum(`Protein(g)`)/sum(`Portion(fl oz)`),
`Caffeine(mg)` = sum(`Caffeine(mg)`)/sum(`Portion(fl oz)`)) %>%
ungroup()
The data has to be a data matrix to create the heatmap, and every row name must be unique. The following code chunk below uses unite() from dplyr package to create a new variable as the row name, and data.matrix() from base R to transform the data frame into a data matrix.
#Create new variable from the existing categorical variables
kids_grp <- kids_grp %>%
unite("Type", c(`Name`,`Milk`,`Whipped Cream`), sep = '-', remove = FALSE)
#Select the relevant variables
kg <- select(kids_grp,c(5:16))
#Change the rows by the drink type instead of row number
row.names(kg) <- kids_grp$Type
#Convert data frame to a data matrix
k_matrix <- data.matrix(kg)
In order to determine the best clustering method for the heatmap and number of cluster the dend_expend() and find_k() functions of dendextend package will be used.
kd <- dist(normalize(k_matrix, method = "euclidean"))
dend_expend(kd)[[3]]
dist_methods hclust_methods optim
1 unknown ward.D 0.5614832
2 unknown ward.D2 0.6088735
3 unknown single 0.6646756
4 unknown complete 0.6243221
5 unknown average 0.7387914
6 unknown mcquitty 0.6958625
7 unknown median 0.5369151
8 unknown centroid 0.6061457
The output table shows that “average” method should be used because it gave the high optimum value. Next, find_k() is used to determine the optimal number of cluster.
Although 10 is the recommended number of clusters, from the graph above, k = 3 is sufficient since the incremental value of the average silhouette compared to k = 10 is very small.
Using heatmaply() from the heatmaply package, the following code chunk is used to create the final visualisation. Some trial and error is performed to determine the optimal figure height and font size.
The normalising method is chosen as the data contains many variables of different scale and of different distributions.
Since this is an interactive chart, readers can also zoom in and focus on a specific area of interest.
#heatmap created with normalised data matrix
heatmaply(normalize(k_matrix),
Colv=NA,
seriate = "none",
colors = Greens,
#no. of clusters is 3
k_row = 3,
margins = c(NA,150,30,NA),
fontsize_row = 5,
fontsize_col = 6,
main = "Starbucks' Kids & Others Drinks Nutritional Content",
xlab = "Nutrition indicators (per fl oz)",
ylab = "Type of Drinks",
#to have a white grid to help identify different cells
grid_gap = 1
)
While the interactive heatmap created using heatmaply allows readers to zoom in and out to focus on a specific area and reads the value via the tooltip, it does not offer an easy way to visualise observations if there are more than one categorical variable. Hence pheatmap package is used to add annotations to differentiate the observations further based on a chosen category.
The following code chunk is used to create a static heatmap using pheatmap(), with “Whipped Cream” as the annotation target.
#filter relevant variables to create the data matrix
kg_p <- select(kids_grp,c(3:16))
#Change the rows by the drink type instead of row number
row.names(kg_p) <- kids_grp$Type
#Convert data frame to a data matrix
k_matrix_p <- data.matrix(kg_p)
k_matrix_p2 <- k_matrix_p[,3:14]
#create an independent data frame comprising of "Whipped Cream" status
k_row <- data.frame("Whipped Cream" = kg_p$`Whipped Cream`)
#ensure that the row names of the annotation data frame matches the row names of the heatmap matrix
rownames(k_row) <- rownames(kg_p)
#using pheatmap to create the heatmap
pheatmap(normalize(k_matrix_p2),
#annotate the rows using the earlier defined data frame
annotation_row = k_row,
clustering_method = "average",
clustering_distance_cols = "euclidean",
cluster_cols = FALSE,
margins = c(NA,150,30,NA),
fontsize_row = 5,
fontsize_col = 7,
#cut the map into 3 clusters
cutree_rows = 3,
color = colorRampPalette(c("white", "dark green"))(60),
main ="Starbucks Kids & Others Drinks \n (Categorised by Whipped Cream)"
)
Based on the above heatmap created by pheatmap, it is clear that one cluster is formed by drinks with whipped cream, and the nutritional indicators of drinks under this category have high value.